Skip to main content

Full text of "Statistical Mechanics of Linear and Nonlinear Time-Domain Ensemble Learning"

See other formats


Typeset with jpsj2.cls <ver.l.2> 



Full Paper 



Statistical Mechanics of Linear and Nonlinear 
Time-Domain Ensemble Learning 

Seiji MIYOSHI 1 * and Masato OKADA 234 

1 Department of Electronic Engineering, Kobe City College of Technology, 
8-3 Gakuenhigashimachi, Nishi-ku, Kobe-shi, 651-2194 
2 Division of Transdisciplinary Sciences, Graduate School of Frontier Sciences, 
The University of Tokyo, 5-1-5 Kashiwanoha, Kashiwa-shi, Chiba, 277-8561 
3 RIKEN Brain Science Institute, 2-1 Hirosawa, Wako-shi, Saitama, 351-0198 

Conventional ensemble learning combines students in the space domain. In this paper, how- 
ever, we combine students in the time domain and call it time-domain ensemble learning. 
We analyze, compare, and discuss the generalization performances regarding time-domain 
ensemble learning of both a linear model and a nonlinear model. Analyzing in the framework 
of online learning using a statistical mechanical method, we show the qualitatively different 
behaviors between the two models. In a linear model, the dynamical behaviors of the gener- 
alization error are monotonic. We analytically show that time-domain ensemble learning is 
twice as effective as conventional ensemble learning. Furthermore, the generalization error 
of a nonlinear model features nonmonotonic dynamical behaviors when the learning rate is 
small. We numerically show that the generalization performance can be improved remarkably 
by using this phenomenon and the divergence of students in the time domain. 

KEYWORDS: ensemble learning, online learning, generalization error, statistical mechanics 



1. Introduction 

Learning can be roughly classified into batch learning and online learning. 1 In batch 
learning, given examples are used more than once. In this paradigm, a student will give the 
correct answers after training if that student has an adequate degree of freedom. However, it 
is necessary to have plenty of time and a large memory for storing many examples. On the 
contrary, in online learning examples used once are then discarded. In this case, a student 
cannot give correct answers for all the examples used in training. There are, however, merits. 
For example, a large memory for storing many examples is not necessary, and it is possible to 
follow a time-variant teacher. 

Recently, we analyzed the generalization performance of some models in a framework of 
online learning using a statistical mechanical method. 2-6 Ensemble learning means to combine 
many rules or learning machines (called students in this paper) that perform poorly; this 
has recently attracted the attention of many researchers. 2, 3,7-10 The diversity or variety of 



*E-mail address: miyoshi@kobe-kosen.ac.jp 



1/11 



J. Phys. Soc. Jpn. Full Paper 

students is essential in ensemble learning. We showed that the three well-known rules, Hebbian 
learning, perceptron learning, and AdaTron learning have different characteristics in their 
affinities for ensemble learning, that is in "maintaining diversity among students" 3 D In that 
process, 13,14 it was subsidiarily proven that in an unlearnable case, 11 ' 12 the student vector 
does not converge in one direction but continues moving. Therefore, we also analyzed the 
generalization performance of a student supervised by a moving teacher that goes around a 
true teacher, 4 proving that the generalization error of a student can be smaller than a moving 
teacher's, even if the student only uses examples from the moving teacher. In an actual human 
society, a teacher observed by a student does not always present the correct answer. In fact, 
many cases the teacher is learning and continues to change. Therefore, analyzing such a model 
is interesting in terms of considering the analogies between statistical learning theories and 
real human society. 

In conventional ensemble learning, generalization performance is improved by combining 
students who have diversities. However, students do not always converge in one direction but 
may continue moving in an unlearnable model. Therefore, generalization performance in such 
a model must be improved by combining students themselves at different times, even if there 
is only one student. Conventional ensemble learning combines students in the space domain. 
In contrast, we introduce a method of combining the students in the time domain, which we 
call "time-domain ensemble learning". 6 

Some studies 15-18 have treated the combining of students in the time domain. We partic- 
ularly pay attention to dynamical behaviors of the generalization performance of the time- 
domain ensemble learning and theoretically analyze it by applying a statistical mechanical 
method. We analytically or numerically obtain, compare, and discuss the order parameters 
and the generalization errors of two models: a linear model in which both teacher and student 
are linear perceptrons 2 with noise and a nonlinear model in which both teacher and student are 
nonlinear perceptrons. The results show that the two models have the qualitatively different 
behaviors. We analytically demonstrate that time-domain ensemble learning of a linear model 
is twice as effective as conventional ensemble learning. Furthermore, we numerically show that 
the generalization performance of a nonlinear model can be ramarkably improved by using 
nonmonotonic dynamical behaviors and the divergence of students in the time domain. 

2. Model 

In this paper we consider a teacher and a student. They are perceptrons with the con- 
nection weights B and J m , respectively, where m denotes the time step. For simplicity, the 
connection weights of the teacher and the student are simply called the teacher and the stu- 
dent. Teacher B = (B lt B N ), student J m = (Jf , . . . , J%), and input x m = (xf, ...,x%) 
are iV-dimensional vectors. Each component B{ of B is independently drawn from Af(0, 1) and 
fixed, where Af(0, 1) denotes a Gaussian distribution with a mean of zero and a variance of 



2/11 



J. Phys. Soc. Jpn. Full Paper 

unity. Each component jf of the initial value J° of J m is independently drawn from JV(0, 1). 
The direction cosine between J m and B is R m and that between J m and J m is q m ' m . Each 
component xf 1 of x m is drawn from JV(0, 1/iV) independently. 

Figure 1 illustrates the relationship among teacher _B, students J m and J m and the 
direction cosines R m ,R m ' , and q m ' m ' . 



B 




/ / i 

Fig. 1. Teacher .B and students J m and J m . R m ,R m , and g m > m are direction cosines. 



In this paper, we also deal with the thermodynamic limit N —* oo. Therefore, \\B\\ = 
y/N, \\J°\\ = y/N, and \\x m \\ = 1. Generally, since the norm II J m \\ of the student changes as 
the time step proceeds, the ratios l m of the norm to \HV are introduced and are called the 
length of the student. That is, || J m \\ = l m ^. 

Linear Case Outputs of the teacher and the student are = v' m +n 7 ^ and of = u m l m +n'j l , 
respectively. Here, v m = B-x m ,u m l m = J m -x m ,n™ ~7^(0,a|) ,nj ~ M (0, aj) , where 
AA(0, a 2 ) denotes a Gaussian distribution with a mean of zero and a variance of a 2 . That 
is, the outputs of the teacher and the student include independent Gaussian noises with 
variances of a 2 B and a 2 , respectively. Then, v m and u m obey Gaussian distributions with 
a mean of zero, a variance of unity, and a covariance of R m . Let us define the error 
between the teacher B and the student J m alone by the squared error of their outputs: 

^ = \ (°S - oj) 2 . (1) 

Student J rn adopts the gradient method as a learning rule and uses input x and an output 
of teacher B for updates. That is, 

am 

J m+1 = jm -VQjk (2) 

+ 7](y + n B — u I —nj)x , (S) 



3/11 



J. Phys. Soc. Jpn. Full Paper 

where n denotes the learning rate of the student and is a constant positive number. The 
part rj {v m + n B — u m l m — n™) has been determined by the learning rule. Generalizing 
this part, we denote it with f m . 

Nonlinear Case The teacher and the student are a nonmonotonic perceptron and a sim- 
ple perceptron, respectively; their outputs are o B = sgn((f m — a)v m (v m + a)) , o™ = 
sgn(u m l m ). Here, v m = B ■ x m and u m l m = J m ■ x m , v m and u m obey Gaussian dis- 
tributions with a mean of zero, a variance of unity, and a covariance of R m , and sgn(-) 
denotes a sign function. Student J m adopts the perceptron learning as a learning rule for 
updates. That is, 

jm+l = j™ + r? e(-0^ y) ^ m , (4) 

where n denotes the learning rate of the student and is a constant positive number, ©■(•) 
denotes a step function. The part ri<d (— o^Oj 1 ) has been determined by the learning 
rule. Generalizing this part, we denote it with f m . 

3. Theory 

3.1 Generalization error 

Conventionally, ensemble learning means to improve performance by combining many 
students that perform poorly. We, however, use just one student and combine copies of it 
(hereafter called "brothers") at different time steps in this paper. Conventional ensemble 
learning combines students in the space domain, whereas, we do so in the time domain. 
In this paper K brothers J mi , J" 12 , . . . , J niK are combined. Here, m\ < m-i < . . . < mx- 
One goal of statistical learning theory is to theoretically obtain generalization errors. Since 
a generalization error is the mean of errors over the distribution of the new input x , the 
generalization errors e 9 of the ensemble in linear and nonlinear cases are calculated as follows: 

Linear Case We use the squared error e for new input x. Here, it is assumed that the 
Gaussian noises are independently added to the teacher and each brother of the ensemble. 
The weight of each brother J nik of the ensemble satisfies Ck > 0. That is, the error of the 
ensemble is 



\(^-x + n B -Y^Ck{J m *-x + n k )^ , 



(5) 



where ree ~ M(0,a B ) and nk ~ A/"(0, o~j). Thus, the generalization error e 9 of the 
ensemble is calculated as follows: 

e 9 = J dxdriB ^ JJ dn^ p(x)p(n B ) ^ JJ p{n k )^ e (6) 
= J dv ( f[ du k j dn B ( f[ dn k j p(v, {u k })p{n B ) ( f[ p(nk) J 



\k=l / \fe=l / \fc=l 



4/11 



J. Phys. Soc. Jpn. Full Paper 

x^u + n B -5^C fc (u fc Z m *+n fc )j (7) 

/ K K K 

= - i -2j2c k r k R mk + zJ2Yl c k c k d m n m « q m ^ m y 

V k=l k=lk'>k 

+ EQ 2 (^) 2 + 4 + EQM), (8) 

k=l k=l / 

where v = B x and u k l mk = J nLk ■ x. We executed integration using the following: v and 
u k obey A/"(0, 1). The covariance between v and u k is i? mfc , and that between u k and u^' 

i S q m k,m k , _ 



Nonlinear Case A majority vote by brothers might be a typical method of combining for a 
nonlinear model in which the output of each student is +1 or —1. However, to simplify 
the analysis we apply the following method: the output of the ensemble is that of a 
new perceptron of which the connection weight is the weighted sum of the normalized 
connection weights J tk /l tk of brothers. That is, 



e = © ^- (v - a) v (v + a) ^ C k Ukj , 



(9) 



where C k > is a weight of each brother J mk in the ensemble. Thus, the generalization 
error e g of the ensemble is expressed as follows: 

dxp{x)e = J dv du k ^j p(v, {u k })Q ^— (v — a) v (v + a)J2C k u k ^j . (10) 

3.2 Differential equations for order parameters, and their solutions 

In this paper, we examine the thermodynamic limit N — ► oo. To do so, updates of Eq.(3) 
or Eq.(4) must be executed O(N) times for the order parameters l,R, and q to change by 
0(1). Thus, the continuous times t±, . . . , tx, which are the time steps mi, . . . , mx normalized 
by the dimension TV, are introduced as the superscripts that represent the learning process. 
To simplify the analysis, we introduced the following auxiliary order parameters r* = 
and Q*'*' = l t l t 'q t ' t ' . The simultaneous differential equations in deterministic forms, 19 which 
describe the dynamical behaviors of order parameters, have been obtained based on the self- 
averaging of thermodynamic limits as follows: 

^ ~ </'«'> + 

£ - (/'»'), (12) 
^ = !'(/''6'>, (13) 



5/11 



J. Phys. Soc. Jpn. Full Paper 

where t' > t and u f = x 1 ' ■ J f //* ~ AA(0, 1). Four sample averages in Eqs.(ll)-(13) are 
obtained by executing integrations where i>* , u* and u* obey the triple-Gaussian distribution 
p(v 1 ' , u 1 ' , u l ), for which the covariance matrix is 

/ 1 R f ' R f \ 

S = R? 1 g*.** . (14) 

\ i?* g*'*' 1 j 
Linear Case The four sample averages can be easily calculated as follows: 

</V> = r?(r*/Z* - /*), (15) 

<(/<) 2 > = v \l + a% + a 2 J + (l t ) 2 -2r t ), (16) 

(/V) = tfl-r'), (17) 

(/*'«*) = r? (r* - Q*'*') /I*. (18) 

Using R° = 0,1° = 1 and Q*'* = (Z*) 2 as initial conditions, we can analytically solve the 
simultaneous differential equations Eqs.(ll)-(13) as follows: 6 

r t = 1 _ e -r,t^ (19) 

{?? = l + ^K + ^)-2e^+(2-^K + a 2 ))e^, (20) 

Q*'*' = 1 - e"* + e-"*' + ((/*) 2 - 1) e"^*'-*) . (21) 

Substituting Eqs.(19)-(21) into Eq.(8), the generalization error e g can be analytically 
obtained as a function of time t^, k = 1, . . . , K. 

Nonlinear Case The four sample averages are obtained as follows: 

^-TsH'^K)- 1 )- 1 )- <22) 

(l/'l-; 2„M /°°d„h( , R '° U/Vfff- , R '" 1 I. (2:1] 



a 



1-iR 1 ) 2 Jo \ Jl-(R^ 2 



</V) = -^(2exp(-^)-l-i?M. (2 1) 



^7T V V 2 

</*V) = - _ 1 - </'V) + (rf - g*'^) </V>) , (25) 

where H{u) = Dx, Dx = -^== exp (—^Pj • The generalization error e g is obtained by 
solving Eqs.(10)-(13), and (22)-(25) numerically. 

4. Results and Discussion 

Figure 1 shows examples of the dynamical behaviors of I and R in a linear model obtained 
analytically, and the corresponding simulation results, where N = 2,000. In the case of a 



6/11 



J. Phys. Soc. Jpn. Full Paper 

linear model, many properties regarding both dynamical behaviors and steady states can be 
analytically proven. 6 For example, I and e g diverge unless < rj < 2. In the case of no 
noise, I asymptotically approaches unity after becoming larger than unity if < rj < 1 and I 
asymptotically approaches unity after becoming smaller than unity if 1 < rj < 2. The larger rj 
is, the faster R rises. However, the convergence of R is fastest when the learning rate satisfies 
rj = 1. This phenomenon can be understood by the fact that rj = 1 is a special condition for 
the gradient method where the student uses up the information obtained from input x. 5 

Figure 2 displays some examples of the dynamical behaviors of I and R in a nonlinear 
model obtained numerically, and the corresponding simulation results, where N = 2, 000. The 
reason why R is negative, which differs from the linear case, is that the threshold a of a 
nonmonotonic teacher is greater than the critical value ac = \/2 In 2 ~ 1.18. 11 This is not 
essential in this paper. When the learning rate rj is relatively large, the dynamical behavior of R 
is monotonic; however, when rj is small, the dynamical behavior of R becomes nonmonotonic. 
That is, \R\ asymptotically approaches a steady value after reaching its maximal one. The 
steady value is not dependent upon rj. 



1.4 
1.2 
1 

0.8 
0.6 
0.4 
0.2 





2 3 
t 



Fig. 2. Dynamical behaviors of I and R in the linear case, rj = 1.0, a 2 B = 0.3, <t 2 j = 0.5, 



Figure 4 (left) presents some examples of the dynamical behaviors of q of a linear model 
obtained analytically, and the corresponding simulation results, where N = 2, 000. For a linear 
model, q increases monotonically when t increases, and increases monotonically when t' — t 
decreases. Furthermore, (?*'*' converges to a smaller value than unity in the case of t' — t 7^ 0.0. 
This means the student itself continues to move after the order parameters reach a steady 
state. Figure 4 (right) shows the relationship between t\ and e g in the case of constant ti — t\ 
and K = 2. Here, e g increases monotonically, remains constant, or decreases monotonically 
depending on the values of rj. 



7/11 



J. Phys. Soc. Jpn. 



Full Paper 



0.5 



-0.5 



10 



20 



l(Eta=1.0) 
1 (Eta=0.5) 
1 (Eta=0.2) 
R(Eta=1.0) 
R (Eta=0.5) 
R (Eta=0.2) 



30 
t 



40 



50 



60 



Fig. 3. Dynamical behaviors of I and R in the nonlinear case, a = 2.0. 



The behaviors of e g when the leading time t\ — > oo and the time interval tk+\ — tk — ► oo 
can be theoretically obtained in the case of a linear model as follows: 6 e g decreases as rj 
decreases regardless of K. When the weights are uniform or Cy. = C = 1/K and K = oo, 
B = limx-»oo ^ J2k=i J tk ■ This m eans the generalization error equals the residual error 
caused by teacher's noise ns- On the other hand, the generalization error e g of K = oo is \ 
times of that of K = 1 when the learning rate satisfies rj = 1, the uniform weights Ct = 1/K, 
a 2 B = a 2 j, t\ — > 00, and — — > 00. Since the generalizaion error e 9 of the conventional 
space-domain ensemble learning with K = 00, rj = 1, C). = l/K and <j 2 b = a 2 is \ times of 
that of K = l, 2 we can say the time-domain ensemble learning is twice as effective as the 
conventional space-domain ensemble learning. This difference can be explained as follows: In 
conventional space-domain ensemble learning, the similarities among students become high 
since all students use the same examples for learning. In time-domain ensemble learning, on 
the other hand, the similarities among brothers become low since all brothers use almost 
totally different examples for learning. 

Figures 5-7 (left) show some examples of the dynamical behaviors of q for a nonlinear 
model obtained numerically, and the corresponding simulation results, where iV = 2,000. 
These figures indicate that q for a nonlinear model behaves nonmonotonically for t when 77 
is small. Figures 5-7 (right) show the relationship between t\ and e g in the case of constant 
t 2 — 1\ and K = 2. The steady value of e g is dependent upon i 2 — 1\ and is not dependent upon 
rj. However, when rj is small, e g behaves nonmonotonically for t\ and has the minimal value 
shown in Figs. 6 (right) and 7 (right). This phenomenon can be considered a kind of over- 
learning. Figure 7 (right) shows that the minimal value of e g decreases when t 2 — t\ increases 
as — > 5 — > 10 and increases when ti — 1\ increases as 10 — > 20. This means that t% — 1\ has an 
optimum value. Figures 6 (right) and 7 (right) reveal that the smaller the learning rate rj is, 



8/11 



J. Phys. Soc. Jpn. 



Full Paper 



t'-t=0.0 
t'-t=0.5 
t'-t=1.0 
t'-t=1.5 
t'-t=2.0 
t'-t=10.0 



b 



1.4 
1.3 
1.2 
1.1 
1 

0.9 
0.8 
0.7 
0.6 
0.5 
0.4 



t2-tl=0.0 




t2-tl=1.0 


_ 


V t2-tl=2.0 


« 


\ t2-tl=5.0 




■ \ t2-tl=10.0 

























1 2 3 4 5 

tl 



Fig. 4. Dynamical behaviors of q and e g in a linear case , rj 
1/K. 



1.0, <r| 



0.3, ct} = 0.5, if = 2, C fc = 



l 

0.8 

0.6 
0.4 
0.2 I 








10 



20 



30 
t 



t'-t=0 
f-t=l 
t'-t=2 
f-t=5 



40 



50 



60 



o 
w 



1 
e 

o 



0.5 
0.45 

0.4 
0.35 

0.3 
0.25 

0.2 
0.15 



t2-tl=0 
t2-tl=5 
t2-tl=10 
t2-tl=20 



10 



20 



30 
tl 



40 



50 



60 



Fig. 5. Dynamical behaviors of q and e g in a nonlinear case , rj — 1.0, a = 2.0, K = 2, Ck = 1/K. 



the smaller the minimal value of e g is. However, if r\ is too small, the learning becomes slow. 
Therefore, if the aim is to decrease the generalization error e g , we should use the smallest rj 
that is possible from the viewpoint of learning speed, set if — t to the optimum value, and 
stop the learning at an adequate time step. 

5. Conclusion 

We have analyzed the generalization performances regarding time-domain ensemble learn- 
ing of both a linear model and a nonlinear model. Analyzing within the framework of online 
learning using a statistical mechanical method, we have demonstrated the qualitatively dif- 
ferent behaviors between the two models. In a linear model, the dynamical behaviors of the 
generalization error are monotonic. We have analytically shown that time-domain ensemble 
learning is twice as effective as conventional ensemble learning. Furthermore, the generaliza- 
tion error of a nonlinear model exhibits nonmonotonic dynamical behaviors when the learning 
rate is small. We have numerically shown that the generalization performance can be remark- 



9/11 



J. Phys. Soc. Jpn. 



Full Paper 



1 






0.5 










0.45 


\ 


0.8 






8 
b 


0.4 


\ t2-tl=0 
i \ t2-tl=5 


0.6 


31 




w 




\ \ t2-tl=10 - 


/ » . a::.: 




| 


0.35 1 










•alizat 






0.4 




f-t=0 * - 


0.3 


-\ y t2-tl=20 


0.2 




f-t=2 
t'-t=5 

t'-t=l() □ " 


Genei 


0.25 
0.2 













0.15 





10 20 30 40 50 60 10 20 30 40 50 

t tl 



Fig. 6. Dynamical behaviors of q and e g in a nonlinear case . r\ = 0.5, a = 2.0, K = 2, Ck = 1/K. 




Fig. 7. Dynamical behaviors of g and e g in a nonlinear case , rj = 0.2, a = 2.0, K = 2, Ck = 1/K. 



ably improved by using this phenomenon together with the divergence of students in the time 
domain. 

Acknowledgments 

This research was partially supported by the Ministry of Education, Culture, Sports, Sci- 
ence, and Technology of Japan, with Grants-in-Aid for Scientific Research 14084212, 15500151, 
16500093, and 18500183. 



10/11 



J. Phys. Soc. Jpn. 



Full Paper 



References 

1) Saad, D. (eds.) (1998) On-line Learning in Neural Networks. Cambridge: Cambridge University 
Press. 

Hara, K. & Okada, M. (2005) Ensemble learning of linear perceptron; Online learning theory. 
Journal of Physical Society of Japan 74:2966-2972. 

Miyoshi, S., Hara, K. & Okada, M. (2005) Analysis of ensemble learning using simple perceptrons 
based on online learning theory. Physical Review E 71:036116. 

Miyoshi, S. & Okada, M. (2006) Analysis of on-line learning when a moving teacher goes around 
a true teacher. Journal of Physical Society of Japan 75(2):024003. 

Miyoshi, S. & Okada, M. (2006) Statistical mechanics of online learning for ensemble teachers. 
Journal of Physical Society of Japan 75(4):044002. 

Miyoshi, S., Uezu, T. & Okada, M. (2006) Statistical mechanics of time domain ensemble learning. 
Journal of Physical Society of Japan 75(8):084007. 

Freund, Y. & Schapire, R.E. (1999) A short introduction to boosting. Journal of Japanese Society 
for Artificial Intelligence 14:771-780. [in Japanese, translation by Abe, N.] 
http://www.boosting.org/ 

Statistical mechanics of ensemble learning. Krogh, A. & Sollich, P. (1997) Physical Review E 
55:811-825. 

Online learning with ensembles. Urbanczik, R. (2000) Physical Review E 62:1448-1451. 
Inoue, J.I. & Nishimori, H. (1997) On-line AdaTron learning of unlcarnable rules. Physical Review 
E 55:4544-4551. 

Inoue, J.I., Nishimori, H. & Kabashima, Y. (1997) A simple perceptron that learns non-monotonic 
rules, cond-mat/9708096. 

Miyoshi, S., Hara, K. & Okada, M. (2004) Analysis of ensemble learning for committee ma- 
chine teacher. Proc. The Seventh Workshop on Information-Based Induction Sciences: 178-185. 
[in Japanese] 

Miyoshi, S., Hara, K. & Okada, M. (2005) Analysis of ensemble learning for non-monotonic teacher. 
IEICE Technical Report NC2004-214:123-128. [in Japanese] 

Dekel, O. & Singer, Y. (2005) Data-driven online to batch conversions. In G. Tesauro, D. S. Touret- 
zky and T.K. Leen (eds.), Advances in Neural Information Processing Systems 18. 
Freund, Y. & Shapire, R.E. (1999) Large margin classification using the perceptron algorithm. 
Machine Learning 37(3):277-296. 

Cesa-Bianchi, N., Conconi, A. & Gnetile, C. (2004) On the generalization ability of on-line learning 
algorithms. IEEE Transactions on Information Theory 5O(9):2050-2057. 

Cesa-Bianchi, N. & Gnetile, C. (2005) Improved risk tail bounds for on-line algorithms. In G. 
Tesauro, D. S. Touretzky and T.K. Leen (eds.), Advances in Neural Information Processing Systems 
18. 

Nishimori, H. (2001) Statistical Physics of Spin Glasses and Information Processing: An Introduc- 
tion. Oxford: Oxford University Press. 



11/11